Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Oct 21, 2025

Which issue does this PR close?

Rationale for this change

This PR preserves the typed_value's Field metadata. This way, we can check for extension types.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Oct 21, 2025
value: Option<&BinaryViewArray>,
index: usize,
) -> Variant<'a, 'a> {
let data_type = typed_value.data_type();
let (_typed_value_field, typed_value_column) = typed_value;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the field information, we can check for extension types like: e21dc1b

@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Oct 21, 2025

cc @scovich @alamb @klion26

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this. I'm not immediately sure how to react, so I just left a couple high level comments while I stew on it more.

Comment on lines 401 to 403
.typed_value_field()
.unwrap()
.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just glancing at this code in isolation, I would guess that .0 is the field and .1 is the array?
But that requires knowing context (ie of this PR)

If this usage will show up often, is it worth returning a newtype instead of a tuple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , please, let's return a type with real field names and documentation

Comment on lines 297 to 308
let typed_value = if let Some(typed_value_array) = typed_value.clone() {
let field_ref = Arc::new(Field::new(
"typed_value",
typed_value_array.data_type().clone(),
true,
));
builder = builder.with_field(field_ref.clone(), typed_value_array.clone());

Some((field_ref, typed_value_array))
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is just Option::map (no ? to complicate things)

Comment on lines 660 to 668
let typed_value = if let Some(typed_value_array) = typed_value.clone() {
let field_ref = Arc::new(Field::new(
"typed_value",
typed_value_array.data_type().clone(),
true,
));
builder = builder.with_field(field_ref.clone(), typed_value_array.clone());

Some((field_ref, typed_value_array))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the dual typing (one in field and one in array), this code is technically no longer infallible -- the data types could disagree. Not sure the best way to address that issue as long as we're passing a full-blown field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this logic is duplicated with the variant array constructor above.

@friendlymatthew
Copy link
Contributor Author

Thanks for taking a stab at this. I'm not immediately sure how to react, so I just left a couple high level comments while I stew on it more.

I think one thing I'm debating is the coupling between the typed_value array and field. If field is an optional metadata, then maybe we can split array and field into separate optional fields.

Comment on lines 401 to 403
.typed_value_field()
.unwrap()
.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes , please, let's return a type with real field names and documentation

@@ -708,6 +730,9 @@ impl From<ShreddedVariantFieldArray> for StructArray {
}
}

pub type TypedValue = (FieldRef, ArrayRef);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combination of an Array (which only has a DataType) and a Field (which has the DataType and metadata information) has come up recently upstream as @paleolimbot is working extension types through DataFusion as well in

We are still looking for a good pattern to use.

I wonder if something like this is too crazy

/// A value that has associated Arrow field infomration
struct TypedValue<T> {
  value: T, // could be ArrayRef, could be ScalarValue
  field: FieldRef,
}

🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably struct TypedArrayRef { value: ArrayRef, field: FieldRef } is best for now (keeping the FieldRef-is-a-type bit opaque). Using FieldRefs as types in DataFusion has mostly led to a lot of cloned metadata and heterogeneity with respect to how the name/metadata/nullability should propagate (or not) but it does seem to be the path of least resistance.

Check for field metadata when accessing typed value
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/drill-field branch from 091eb35 to d96ef3d Compare November 17, 2025 18:43
Copy link
Contributor Author

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +769 to +781
// note: these methods below make me want to impl Array for TypedArrayRef...
pub fn slice(&self, offset: usize, length: usize) -> Self {
let Self { inner, field } = self;

Self {
inner: inner.slice(offset, length),
field: Arc::clone(field),
}
}

pub fn is_valid(&self, index: usize) -> bool {
self.inner.is_valid(index)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what people think. Since TypedArrayRef owns the inner ArrayRef, it'll be pretty easy to get TypedArrayRef to also impl Array.

I don't have a strong opinion, but it feels weird to rewrite methods that exist on a trait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if we want to find a better home for this struct. I suspect a TypedArrayRef would be useful in various parts of the library, not just the variant-related work 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely solves the problem of writing APIs that need to propgagate extension metadata without constantly pairing a FieldRef. It may be worth starting with the internal struct and the methods like you have here and moving to a trait when there's a case that arises elsewhere that also needs that concept? (I'm not very good at predicting where things should go inside the arrow-rs crates)

Comment on lines +795 to +800
impl PartialEq for TypedArrayRef {
#[allow(clippy::op_ref)]
fn eq(&self, other: &Self) -> bool {
&self.inner == &other.inner && self.field == other.field
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was forced to add references to both operands (removing it fails to compile). Spent a lot of time this morning trying to fix/investigate this issue

Weirdly, using a tuple like I did in a prior iteration of this PR would compile 🤔

For those curious:

Here's a repro: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=c868bf8f2f8e2b699f9a84b55a89cd71

And here's a more reduced version: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=224ffe1076c24e1c8db5d4a8f53ca2fc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have vague memories of hitting head-scratchy bugs in comparison-wielding code like this, and was informed there are actually rustc bugs involved. Might it be rust-lang/rust#31740?

Comment on lines +1777 to +1779

#[test]
fn test_fixed_size_binary_without_uuid_extension() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a test that will err when trying to access a FixedSizeBinary(16) without a Uuid extension type

Comment on lines +1804 to +1805
#[test]
fn test_fixed_size_binary_with_uuid_extension() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a test case that will pass since we properly annotated the Uuid extension type to its field metadata

@friendlymatthew
Copy link
Contributor Author

@scovich @klion26 wdyt?

@scovich
Copy link
Contributor

scovich commented Nov 19, 2025

I hope to get to this soon, but am a bit buried at work this week and last...

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a lot of potentially unnecessary churn?
If we can't eliminate it with Deref etc, might we consider a separate PR that separates the churn from functional changes?

Comment on lines +769 to +770
// note: these methods below make me want to impl Array for TypedArrayRef...
pub fn slice(&self, offset: usize, length: usize) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we could actually leverage Array::slice because it returns Arc<dyn Array>, which cannot be converted to TypedArrayRef?

But we could potentially impl Deref for TypedArrayRef with Target = dyn Array, to cover e.g. is_valid and the various array-casting calls in the shredding code?

Comment on lines -942 to +1075
let data_type = typed_value.data_type();
let TypedArrayRef {
inner: typed_value_array,
field: typed_value_field,
} = typed_value;

let data_type = typed_value_array.data_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this churn, an impl Deref could make it transparent?
(see other comment)

Comment on lines +1090 to +1091
typed_value_field.try_extension_type(),
Ok(arrow_schema::extension::Uuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: It seems like we had a TODO somewhere, to check for an extension type without having to actually allocate it? Or is this check super cheap and we don't care?

Comment on lines -62 to +66
let Some(typed_value) = shredding_state.typed_value_field() else {
let Some(BorrowedTypedArrayRef {
inner: typed_value_array,
field: _typed_value_field,
}) = shredding_state.typed_value_field()
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, Deref seems like a less invasive option here?

Comment on lines -3162 to +3169
.with_field("metadata", Arc::new(metadata_array), false)
.with_field("value", Arc::new(value_array), true)
.with_field("typed_value", Arc::new(typed_value_struct), true)
.with_column_name("metadata", Arc::new(metadata_array), false)
.with_column_name("value", Arc::new(value_array), true)
.with_column_name("typed_value", Arc::new(typed_value_struct), true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a fair bit of churn from this rename, and I'm not certain the new name is accurate, given that it specifies more than just the column name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually bad to keep the original with_field name, given that its signature nearly matches Field::new? And the new method can keep its (also perfectly descriptive) name with_field_ref?

typed_value: self.typed_value_field(),
typed_value: self
.typed_value_field()
.map(|TypedArrayRef { inner, field }| BorrowedTypedArrayRef { inner, field }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this merit an impl From?

VariantArray::try_new(&struct_array).expect("should create variant array");

let res = variant_array.try_value(0);
assert!(res.is_err())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert the error message here?

Comment on lines 429 to 442
// // inspect the typed_value Field and make sure it contains the canonical Uuid extension type
// let typed_value_field = variant_array
// .inner()
// .fields()
// .into_iter()
// .find(|f| f.name() == "typed_value")
// .unwrap();

// assert!(
// typed_value_field
// .try_extension_type::<extension::Uuid>()
// .is_ok()
// );

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to uncomment this check in current PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants